Lesson Objectives

At the end of this lecture you should be able to:
  1. Simulate error around a model
  2. Plot functions of two variables in R
  3. Use R to check your work in calculus
  4. Perform basic linear algebra in R

Resources

mosaicCalc in R: https://cran.r-project.org/web/packages/mosaicCalc/vignettes/Calculus_with_R.html

Simulating Data and Error

In class previously we have talked about simulating data. This is a fundamental process to examine most statistical and mathematical methodologies computationally. When we simulate data, we assume some underlying model and then generate many data points under that assumption.

Here is a quick review of some of the helpful functions we’ve used in R so far to simulate data.

# Set a random seed for reproducibility
set.seed(9635)
# Generate a list of integers increasing by 1
x <- 1:10
# Generate a sequence of equally spaced numbers within a range 
x<- seq(from = 1,to = 5,by = 0.5)
x <- seq(from = 1,to = 5, length.out = 100)
# Generate a list of numbers following a random uniform distribution (i.e. equal probability between the min and max number)
x <- runif(n = 100,min = 0,max = 1)
hist(x)

# Generate a list of numbers following the normal distribution with a specified mean and sd
x <- rnorm(n = 100,mean =0,sd = 1)
hist(x)

# Generate a list of integers following the binomial distribution with a given probability and number of trials
x <- rbinom(n = 100,size = 1,prob = 0.5)
hist(x)

You will learn more about different probability distributions in your statistics courses but each probability distribution has a matching function in R from which you can generate that distribution.

Now, if we return to our previously defined function from last class: \[f(x)=x\text{sin}\left(\frac{1}{x}\right)\] You’ll recall this function looks like this:

# Define our function
myFunction<-function(x){
  x*sin(1/x)
}
# Set domain of x
x <- seq(from = -1,to = 1,length.out = 10000)
# Calculate f(x) over our domain of x
y <- myFunction(x)

# Plot the function over our domain of x
plot(x,y,type="l")

Real world data never looks this clean, there is always additional noise compared to the simple model of relationships. We often consider an actual model for this noise as well.

Suppose for instance that we wanted to study the above relationship under different noise parameters. We would start by generating our y values over x, and then can use our random functions to ‘add’ noise to the relationship. For instance:

# Calculate noise parameter following a uniform distribution
noise <- runif(n = length(x),min = 0,max = 0.05)
hist(noise)

# Re-calculate f(x) with a noise parameter
y <- myFunction(x) + noise

# Plot f(x) with noise
plot(x,y,type = "l")

Here, we added random noise to each point, with an error between 0 and 0.05 following a uniform distibution. We could do the same following a normal distribution:

# Calculate noise parameter using a normal distribution
noise <- rnorm(length(x),mean = 0,sd = 0.1)
hist(noise)

# Re-calculate f(x) with new noise parameter
y <- myFunction(x) + noise
# Replot
plot(x,y,type="l")

Here, we added noise with a mean of 0, but a standard deviation of 0.05 which generated more noise than above. If you increase the standard deviation, the amount of noise will also increase.

Calculus in R

As you’ve seen so far, graphic functions plays a huge role in calculus. Knowing your basic graphs, and understanding how they are transformed is a key concept in calculus. See below for a solid cheat sheet.

To plot up until now, we’ve generated values of x over some domain, and then simulated values of y to plot.

However, in R’s go-to calculus package, we have more options. We’ll start by downloading.

#install.packages('remotes')
#remotes::install_github("ProjectMOSAIC/mosaicCalc", ref="beta") # must use this version for plots
library(mosaicCalc)

There are three plotting functions that are useful to us: slice_plot(), which works similarly to the plots we’ve made so far, contour_plot() which is for a function of two variables, and interactive_plot() which produces an HTML widget to interact with plots.

Lets start with a basic straight line, corresponding to \[f(x)=2x+3\]

slice_plot(2 * x + 3 ~ x,domain(x = range(0,10)))

You can see that this plot did it all in one, took our function over x, were we could set the domain within one line of code. Note how we are defining our function differently than when defined as a function.

We can also use the makeFun() functionality for even more intuitive use:

# Prespecify our function
f <- makeFun(2 * x^2 - 5 * x + 2 ~ x)
# Plot function over specified domain
slice_plot(f(x) ~ x,domain(x = range(-2,2)))

And, as an added bonus, once a function is named this way in R, we can evaluate it using a given input:

f(x = 1)
## [1] -1

We can also apply functions on our function. Consider:

slice_plot(sqrt(abs(f(x))) ~ x,domain(x = range(-5,5)))

Suppose you have a function of two variables that you want to plot, such as \[ f(x,y) = \sin\left(\frac{2\pi y}{10}\right)e^{-2x}\]

To do this, you need to use the contour_plot() function.

f <- makeFun(sin((2*pi*y)/10)*exp(-2*x) ~ x & y)
contour_plot(f(x,y) ~ x & y,domain(x = 0:5,y = 0:10))

Note that this plot is actually occurring in 3D and thus countor_plot is translating it to two dimensions for us. This plot can therefore be more easily interpreted in 3D using the interactive_plot():

interactive_plot(f(x,y) ~ x & y,domain(x = 0:5,y = 0:10))
## Loading required namespace: plotly

Differentiaion and Integration

A lot of statistical theory relies on differentiation and integration. A cheat sheet to some rules of derivatives can be found here https://tutorial.math.lamar.edu/pdf/calculus_cheat_sheet_derivatives.pdf.

While you may find yourself solving some derivatives on pen and paper this year, you can check your work within R. The D() function in the mosaic package takes a mathematical expression, a tilde, and the variable with which the derivative should be taken. Of note, your statistics courses will cover the theory as needed so here we’re really just focusing on show you how you can use R as a tool to check your work.

D(x^2 + x + 3 ~ x)
## function (x) 
## 2 * x + 1

As you can see, this returns a function which is the derivative of the equation we fed it.

If we want to solve this derivative at various values of x:

g <- D(x^2 + x + 3 ~ x)
g
## function (x) 
## 2 * x + 1
g(5)
## [1] 11

As you can imagine, this becomes very beneficial for complicated derivatives. Consider:

g <- D(sin(x^2) * cos(x)/tan(x) ~ x)
g
## function (x) 
## {
##     .e1 <- cos(x)
##     .e2 <- tan(x)
##     .e3 <- x^2
##     (2 * (x * .e1 * cos(.e3)) - (1/(.e1 * .e2) + sin(x)) * sin(.e3))/.e2
## }
g(20)
## [1] -3.068916

In the above case, we can see that what is being returned is a numerical approximation to the derivative.

You can also easily solve second derivatives. One option is to:

d1 <- D(sin(x) ~ x)
d2 <- D(d1(x) ~ x)
d1
## function (x) 
## cos(x)
d2
## function (x) 
## -sin(x)

Or, in short hand form:

D(sin(x) ~ x&x)
## function (x) 
## -sin(x)

From here, you might expect that we can also solve partial derivatives which solves with respect to multiple variables!

pxy <- D(2*x + x*sin(y^2) ~ x & y)
pxy
## function (x, y) 
## 2 * y * cos(y^2)

Integration

Integration is simply the anti-derivative. When we integrate we are starting with the derivative and aiming to “undo” the differentiation to find the original function.

In R, we can use the antiD function to do this!

Suppose we have:

f <- makeFun(2*x^2 + 3*x ~ x)
df <- D(f(x) ~ x)
df
## function (x) 
## 4 * x + 3
antiDF <- antiD(4*x + 3 ~ x)
antiDF
## function (x, C = 0) 
## 2 * x^2 + 3 * x + C

We can plot these using slice_plot():

slice_plot(df(x) ~ x,domain(x = -1:1))

slice_plot(antiDF(x) ~ x,domain(x = -1:1))

Recall that sometimes there are many possibilities for an anti-derivative due to the addition of a constant. By default, this function in R considers 0 as the constant.

If we want to calculate the integral over a given domain, we can simply plug in values and subtract the difference. For example if we wanted to assess the above integral over our domain of -1 to 1, we would have:

\[ \int_{-1}^{1} 4x + 3 \,dx \]

antiDF <- antiD(4*x + 3 ~ x)
antiDF(1) - antiDF(-1)
## [1] 6

Linear Algebra

Linear algebra is an important discipline in any kind of computation, and hence, R is well suited to evaluate traditional linear algebra problems. We’ve already been setting vectors, creating matrices, etc. throughout this course, so you’re well versed in the basics of linear algebra in R. However, it is useful to review some of the other linear algebra functionality.

Transposing a matrix

We briefly discussed matrix transposition earlier in the semester when we talked about correlation matrices. As a refresher, the transpose of a matrix is found by swapping its rows and columns, as shown in the diagram below.

In R, we can use the build in function t to find the transpose of a matrix.

M <- matrix(1:20,nrow = 5)
M
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
t(M)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15
## [4,]   16   17   18   19   20

Matrix Multiplication

Recall that if A and B are both matrices, then to multiply A And B we would follow the following diagram:

We multiply the rows in A along the columns in B, and add between cell values. This is also called the dot product. Importantly, you must multiply and m x n matrix by an n x m matrix.

In R, we calculate this using the %*% synatx. For example:

A <- matrix(runif(100,0,100),ncol = 10)
B <- A + rnorm(10)
A %*% B
##           [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
##  [1,] 32136.94 33891.00 31745.15 36511.05 32951.96 33224.17 49611.22 33238.41
##  [2,] 16466.70 23928.09 13953.74 23443.48 21179.66 19733.53 29658.81 26004.82
##  [3,] 27823.18 33007.67 27996.56 35785.69 28055.31 31985.82 43033.76 38662.76
##  [4,] 26225.22 23565.65 29114.46 28683.98 21958.37 25918.88 35291.10 25018.91
##  [5,] 31443.20 28442.31 30184.65 32738.56 26747.54 34014.16 36722.90 31523.79
##  [6,] 25744.02 28200.98 22911.27 30854.96 23431.85 28125.93 41337.11 33149.99
##  [7,] 23749.92 22473.68 23633.24 26963.77 19977.45 22866.90 33391.08 21175.41
##  [8,] 28668.35 28716.50 28758.95 27858.52 29353.63 29813.47 38436.98 29297.03
##  [9,] 28210.66 26173.96 27315.33 29721.80 22642.51 29420.33 38640.92 29919.86
## [10,] 13110.11 14277.94 15544.86 18010.21 15339.94 11644.11 22980.96 13479.27
##           [,9]    [,10]
##  [1,] 20997.26 29438.88
##  [2,] 14128.25 21284.09
##  [3,] 18388.36 29567.85
##  [4,] 15433.17 22606.68
##  [5,] 14477.21 28497.84
##  [6,] 17405.62 24207.20
##  [7,] 15635.44 21307.52
##  [8,] 17083.81 23981.23
##  [9,] 15898.66 22535.10
## [10,] 11459.81 17273.12

Diagonals

The diagonals of a matrix are often special cases, which we want to extract. To do this, we can use the diag() function in R. However, depending on what object we feed to diag, we receive very different output.

For instance, if we feed diag() a vector object, v, it will return a matrix with v on the diagonal.

v <- 1:5
diag(v)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    2    0    0    0
## [3,]    0    0    3    0    0
## [4,]    0    0    0    4    0
## [5,]    0    0    0    0    5

If we feed diag() a Matrix M, we’ll return the diagonal entries of M

M <- matrix(1:25,ncol = 5)
M
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   11   16   21
## [2,]    2    7   12   17   22
## [3,]    3    8   13   18   23
## [4,]    4    9   14   19   24
## [5,]    5   10   15   20   25
diag(M)
## [1]  1  7 13 19 25

If we feed diag a constant k, we’ll return a k by k identity matrix:

diag(3)
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

For those of you new to linear algebra, note that an identity matrix is characterized by being square, having 1’s on the diagonal, and 0 otherwise.

Sometimes we may be interested in extracting either the lower or upper triagular matrix, like in the following graphic:

And we may elect to choose the diagonal as part of either matrix (or not) as the application desires.

In R, this is done by using a logical statement:

M
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   11   16   21
## [2,]    2    7   12   17   22
## [3,]    3    8   13   18   23
## [4,]    4    9   14   19   24
## [5,]    5   10   15   20   25
a <- upper.tri(M,diag = F)
a
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] FALSE  TRUE  TRUE  TRUE  TRUE
## [2,] FALSE FALSE  TRUE  TRUE  TRUE
## [3,] FALSE FALSE FALSE  TRUE  TRUE
## [4,] FALSE FALSE FALSE FALSE  TRUE
## [5,] FALSE FALSE FALSE FALSE FALSE
M[a]
##  [1]  6 11 12 16 17 18 21 22 23 24
ifelse(a == T,M,0)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    6   11   16   21
## [2,]    0    0   12   17   22
## [3,]    0    0    0   18   23
## [4,]    0    0    0    0   24
## [5,]    0    0    0    0    0

Solving Linear Equations

Matrix algebra can be fundamental in solving linear equations.

Suppose we have the following equation:

\[ b = A \cdot x\]

Where b and x are vectors, and A is a matrix. If we only know b and A, we can use R to find x.

A <- diag(1:5)
b <- 6:10
solve(A,b)
## [1] 6.000000 3.500000 2.666667 2.250000 2.000000

So \(x=(6,3.5,2.66667, 2.25,2)\).

We can also use linear algebra to solve systems of equations:

Suppose we have the following equations:

\[ x + 5y = 1 \\ 2x -2y = 1\\ 4x + 0y = 1 \]

In terms of vecors, we can rewrite this as:

\[ x \begin{pmatrix} 1 \\ 2\\4 \end{pmatrix} + y \begin{pmatrix} 5 \\ -2 \\ 0 \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} \]

Lets call each vector on the LHS v, and the the vector on the RHS b:

v <- c(1,2,4) 
v2 <- c(5,-2,0)
b <- c(1,1,1)

We can then solve this system of equations by projecting \(b\) into the space defined by \(v_1\) and \(v_2\).

library(mosaic)
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## The following object is masked from 'package:Matrix':
## 
##     mean
## The following object is masked from 'package:ggplot2':
## 
##     stat
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
mosaic::project(b ~ v + v2)
##          v         v2 
## 0.32894737 0.09210526

Hence, \(x=0.3289\) and \(y=0.092\).

Notice that these values do not ‘exactly’ solve the linear system, but rather, it provides the the solutions that come closest to creating the right answer by relying on numerical methods.

Eigenvlues and Eigenvectors

One of the most common uses of linear algebra in statistics and data science is the concept of finding the eigenvalues or eigen vectors. Eigen is a german word, meaning ‘typical’, and in English we sometimes think of this as being “characteristic”. So when we say we are finding eigen vectors what we are looking for is the characteristic vector of a matrix, meaning its preserves information about the matrix in lower dimensions.

This concept is used all the time. Its the foundation of PCA’s. We use the concept of an ‘eigen gene’ in a pathway-based network analysis of genetic data called Weighted Gene Correlation Network Analysis (WGCNA). Eigenvectors are also a foundational theory behind google’s page rank algorithm. You’ll see this concept over and over and over again.

We can use the eigen() function in R to easily obtain these values.

A
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    2    0    0    0
## [3,]    0    0    3    0    0
## [4,]    0    0    0    4    0
## [5,]    0    0    0    0    5
eigen(A)
## eigen() decomposition
## $values
## [1] 5 4 3 2 1
## 
## $vectors
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    1
## [2,]    0    0    0    1    0
## [3,]    0    0    1    0    0
## [4,]    0    1    0    0    0
## [5,]    1    0    0    0    0

You can also easily identify the determinants of a matrix in R, to check any work you may be doing by hand in calculating eigenvalues:

A
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    2    0    0    0
## [3,]    0    0    3    0    0
## [4,]    0    0    0    4    0
## [5,]    0    0    0    0    5
det(A)
## [1] 120

Singular Value Decomposition

Singular value decompostion, or SVD, is another popular linear algebra idea which is commonly used in data science (particularly machine learning). SVD is a factorization of a matrix into three matrices. Similarly we can easily obtain this in R using the svd() function.

M
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   11   16   21
## [2,]    2    7   12   17   22
## [3,]    3    8   13   18   23
## [4,]    4    9   14   19   24
## [5,]    5   10   15   20   25
svd(M)
## $d
## [1] 7.425405e+01 3.366820e+00 2.421025e-15 3.931886e-16 1.737407e-16
## 
## $u
##            [,1]       [,2]       [,3]        [,4]       [,5]
## [1,] -0.3926228 -0.6677180  0.6152910 -0.09083491 -0.1147432
## [2,] -0.4191312 -0.3526033 -0.6493214  0.41570434 -0.3249178
## [3,] -0.4456396 -0.0374885 -0.3532681 -0.71115382  0.4116575
## [4,] -0.4721479  0.2776263  0.1933366  0.53853426  0.6104112
## [5,] -0.4986563  0.5927410  0.1939620 -0.15224987 -0.5824077
## 
## $v
##             [,1]        [,2]       [,3]        [,4]       [,5]
## [1,] -0.09359323  0.76892152 -0.5097751  0.04655357 -0.3714325
## [2,] -0.24363203  0.49055421  0.3403498  0.38809862  0.6584386
## [3,] -0.39367083  0.21218690  0.6867977 -0.35622289 -0.4487918
## [4,] -0.54370962 -0.06618041 -0.3555445 -0.63806433  0.4079976
## [5,] -0.69374842 -0.34454772 -0.1618280  0.55963504 -0.2462119